Search CORE

406 research outputs found

Analysis of the Web Graph Aggregated by Host and Pay-Level Domain

Author: A Barabasi
A Clauset
J Leskovec
R Kumar
R Meusel
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 06/03/2018
Field of study

In this paper the web is analyzed as a graph aggregated by host and pay-level domain (PLD). The web graph datasets, publicly available, have been released by the Common Crawl Foundation and are based on a web crawl performed during the period May-June-July 2017. The host graph has

\sim

1.3 billion nodes and

\sim

5.3 billion arcs. The PLD graph has

\sim

91 million nodes and

\sim

1.1 billion arcs. We study the distributions of degree and sizes of strongly/weakly connected components (SCC/WCC) focusing on power laws detection using statistical methods. The statistical plausibility of the power law model is compared with that of several alternative distributions. While there is no evidence of power law tails on host level, they emerge on PLD aggregation for indegree, SCC and WCC size distributions. Finally, we analyze distance-related features by studying the cumulative distributions of the shortest path lengths, and give an estimation of the diameters of the graphs

arXiv.org e-Print Archive

Crossref

Enriching product ads with Metadata from HTML annotations

Author: D Qiu
D Vandic
H Nguyen
M Bakker de
NV Chawla
R Ghani
R Meusel
R Meusel
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Crossref

MAnnheim DOCument Server

Micro-CernVM: Slashing the Cost of Building and Deploying Virtual Machines

Author: Berzano D.
Blomer J.
Buncic P.
Charalampidis I.
Ganis G.
Lestaris G.
Meusel R.
Nicolaou V.
Publication venue: 'IOP Publishing'
Publication date: 11/11/2013
Field of study

The traditional virtual machine building and and deployment process is centered around the virtual machine hard disk image. The packages comprising the VM operating system are carefully selected, hard disk images are built for a variety of different hypervisors, and images have to be distributed and decompressed in order to instantiate a virtual machine. Within the HEP community, the CernVM File System has been established in order to decouple the distribution from the experiment software from the building and distribution of the VM hard disk images. We show how to get rid of such pre-built hard disk images altogether. Due to the high requirements on POSIX compliance imposed by HEP application software, CernVM-FS can also be used to host and boot a Linux operating system. This allows the use of a tiny bootable CD image that comprises only a Linux kernel while the rest of the operating system is provided on demand by CernVM-FS. This approach speeds up the initial instantiation time and reduces virtual machine image sizes by an order of magnitude. Furthermore, security updates can be distributed instantaneously through CernVM-FS. By leveraging the fact that CernVM-FS is a versioning file system, a historic analysis environment can be easily re-spawned by selecting the corresponding CernVM-FS file system snapshot.Comment: Conference paper at the 2013 Computing in High Energy Physics (CHEP) Conference, Amsterda

arXiv.org e-Print Archive

CERN Document Server

CernVM Online and Cloud Gateway: a uniform interface for CernVM contextualization and deployment

Author: Blomer J
Buncic P
Charalampidis I
D Berzano
G Ganis
G Lestaris
I Charalampidis
J Blomer
P Buncic
R Meusel
Publication venue: 'IOP Publishing'
Publication date: 01/01/2014
Field of study

In a virtualized environment, contextualization is the process of configuring a VM instance for the needs of various deployment use cases. Contextualization in CernVM can be done by passing a handwritten context to the user data field of cloud APIs, when running CernVM on the cloud, or by using CernVM web interface when running the VM locally. CernVM Online is a publicly accessible web interface that unifies these two procedures. A user is able to define, store and share CernVM contexts using CernVM Online and then apply them either in a cloud by using CernVM Cloud Gateway or on a local VM with the single-step pairing mechanism. CernVM Cloud Gateway is a distributed system that provides a single interface to use multiple and different clouds (by location or type, private or public). Cloud gateway has been so far integrated with OpenNebula, CloudStack and EC2 tools interfaces. A user, with access to a number of clouds, can run CernVM cloud agents that will communicate with these clouds using their interfaces, and then use one single interface to deploy and scale CernVM clusters. CernVM clusters are defined in CernVM Online and consist of a set of CernVM instances that are contextualized and can communicate with each other.Comment: Conference paper at the 2013 Computing in High Energy Physics (CHEP) Conference, Amsterda

arXiv.org e-Print Archive

Crossref

CERN Document Server

VoldemortKG: Mapping schema.org and Web Entities to Linked Open Data

Author: E Oren
G Tummarello
L Otero-Cerdeira
M Bron
R Meusel
Publication venue
Publication date: 09/12/2016
Field of study

Crossref

RERO DOC Digital Library

Opportunities for Nuclear Astrophysics at FRANZ

Author: A. Schempp
F. Käppeler
Jaag
K. Volk
Knie
Käppeler
L. P. Chau
M. Heil
O. Meusel
R. Plag
R. Reifarth
Ratynski
Reifarth
U. Ratzinger
Uberseder
Walter
Publication venue: 'CSIRO Publishing'
Publication date: 01/01/2009
Field of study

The "Frankfurter Neutronenquelle am Stern-Gerlach-Zentrum" (FRANZ), which is currently under development, will be the strongest neutron source in the astrophysically interesting energy region in the world. It will be about three orders of magnitude more intense than the well-established neutron source at the Research Center Karlsruhe (FZK)

arXiv.org e-Print Archive

Crossref

Verification and Validation of Semantic Annotations

Author: B Mohit
C Fürber
CH Chang
E Kärle
H Mühleisen
I Boneva
P Mika
R Meusel
RV Guha
T Berners-Lee
U Şimşek
Z Akbar
Publication venue
Publication date: 20/05/2019
Field of study

In this paper, we propose a framework to perform verification and validation of semantically annotated data. The annotations, extracted from websites, are verified against the schema.org vocabulary and Domain Specifications to ensure the syntactic correctness and completeness of the annotations. The Domain Specifications allow checking the compliance of annotations against corresponding domain-specific constraints. The validation mechanism will detect errors and inconsistencies between the content of the analyzed schema.org annotations and the content of the web pages where the annotations were found.Comment: Accepted for the A.P. Ershov Informatics Conference 2019(the PSI Conference Series, 12th edition) proceedin

arXiv.org e-Print Archive

Crossref

Prototype Testing of the Frankfurt Gabor Lens at HOSTI

Author: Adonin A.
Berezov R.
Droba M.
Hollinger R.
Meusel O.
Pfister J.
Ratzinger U.
Schulte K.
Publication venue: GSI Helmholtzzentrum für Schwerionenforschung
Publication date: 01/01/2013
Field of study

GSI Repository

Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

Author: Bizer C.
Eckert K.
Meusel R.
Mühleisen H.F. (Hannes)
Schuhmacher M.
Völker J. (Johanna)
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/10/2014
Field of study

More and more websites embed structured data describing for instance products, reviews, blog posts, people, organizations, events, and cooking recipes into their HTML pages using markup standards such as Microformats, Microdata and RDFa. This development has accelerated in the last two years as major Web companies, such as Google, Facebook, Yahoo!, and Microsoft, have started to use the embedded data within their applications. In this paper, we analyze the adoption of RDFa, Microdata, and Microformats across the Web. Our study is based on a large public Web crawl dating from early 2012 and consisting of 3 billion HTML pages which originate from over 40 million websites. The analysis reveals the deployment of the different markup standards, the main topical areas of the published data as well as the different vocabularies that are used within each topical area to represent data. What distinguishes our work from earlier studies, published by the large Web companies, is that the analyzed crawl as well as the extracted data are publicly available. This allows our ﬁndings to be veriﬁed and to be used as starting points for further domain-speciﬁc investigations as well as for focused information extraction endeavors

CWI's Institutional Repository